A Relevance feedback based approach for mixed script transliterated text search: Shared Task report by BIT Mesra, India

نویسندگان

  • Amit Prakash
  • Sujan Kumar Saha
چکیده

This paper describes the experiments carried out as part of the participation in FIRE-2014 Transliterated Search Shared task. We participated in subtask-2 and submitted two results generated by systems based on relevant feedback approach. Given a collection of documents in mixed script, the task is to retrieve relevant documents using queries in either script. The spelling variation between different versions of transliterated text results in a query mismatch problem which makes this task quite challenging. The basic idea behind our approach is the observation that small words posses very little spelling variation while transliterating it into a non native script. We proposed two n-gram approaches for query expansion based on number of characters in n-gram and their frequent occurrence. Using relevant feedback we got 14% gain in NDCG@1 and an improvement of 10% in MRR. The results suggest that our approach is quite helpful in retrieving relevant documents from mixed script document collection.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Mixed Script Ad hoc Retrieval using back transliteration and phrase matching through bigram indexing: Shared Task report by BIT, Mesra

This paper describes an approach for Mixed-script Ad hoc retrieval, a subtask as part of FIRE 2015 Shared Task on Mixed Script Information Retrieval. We participated in subtask 2 of the shared task, where a statistical model was used to carry out back transliteration to Devanagari script. To perform the search, bigram based index of the documents were used and search was performed using pivot t...

متن کامل

Query Labelling for Indic Languages using a hybrid approach

With a boom in the internet, social media text has been increasing day by day. Much of the user generated content on internet is written in a very informal way. Usually people tend to write text on social media using indigenous script. To understand a script different from ours is a difficult task. Moreover, nowadays queries received by the search engines are large number of transliterated text...

متن کامل

Overview of FIRE-2015 Shared Task on Mixed Script Information Retrieval

The Transliterated Search track has been organized for the third year in FIRE-2015. The track had three subtasks. Subtask I was on language labeling of words in code-mixed text fragments; it was conducted for 8 Indian languages: Bangla, Gujarati, Hindi, Kannada, Malayalam, Marathi, Tamil, Telugu, mixed with English. Subtask II was on ad-hoc retrieval of Hindi film lyrics, movie reviews and astr...

متن کامل

DA-IICT in FIRE 2015 Shared Task on Mixed Script Information Retrieval

This paper aims to describe the methodology followed by Team Watchdogs in their submission for the shared task on Mixed Script Information Retrieval (MSIR) in FIRE 2015. I participated in the subtask 1 (Query Word Labelling) and 2 (Mixed-script Ad hoc retrieval). For subtask 1, Machine Learning approach using CRF classifier was used to classify the tokens as one of the possible languages using ...

متن کامل

Encoding transliteration variation through dimensionality reduction: FIRE Shared Task on Transliterated Search

There exist a large amount of user generated Web content in Roman script for the languages which are written in indigenous scripts for various reasons. In the light of this phenomenon, the search engines face a non-trivial problem of matching queries and documents in transliterated space where transliterated content contain extensive spelling variation. This paper describes our proposed method ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014